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Abstract 

In this paper, we explore the consequences of a distinction between 'live' and 'dead' 
network nodes; 'live' nodes are able to acquire new links whereas 'dead' nodes are 
static. We develop an analytically soluble growing network model incorporating this 
distinction and show that it can provide a quantitative description of the empirical net- 
work composed of citations and references (in- and out-links) between papers (nodes) 
in the SPIRES database of scientific papers in high energy physics. We also demonstrate 
that the death mechanism alone can result in power law degree distributions for the 
resulting network. 



1 Introduction 

The study and modeling of complex networks has expanded rapidly in the 
new millennium and is now firmly established as a science in its own right 
(Watts, 1999||Albert & Barabasi, 2002||Dorogovtsev & Mendes, 2002||r¥wman72003 1 . 



One of the oldest examples of a large complex network is the network of ci- 
tations and references (in- and out-links) between scientific papers (nodes) 
<de Solla Price, 1965| |Redner, 1998| |Lehmann et al, 2003| |Lehmann et al, 2005| 
Redner, 2004). A very successful model describing networks with power-law 
degree distributions is based on the notion of preferential attachment. The princi- 
ples underlying this model were first introduced by Simon (Simon, 19571, ap- 
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plied to citation networks by de Solla Price jde Solla Price, 1976V , and inde- 
pendently rediscovered by Barabasi and Albert I Barabasi & Albert, 1999t - Var- 
ious modifications of the preferential attachment model have appeared more 
recently. In the present context, the key papers on preferential attachment are 
jLehmann et al, 2003||Lehmann et al, 2005||krapivsky et al, 2000||Krapivsky & Redner, 2001 



Klemm & Egufluz, 2002} . Simplicity is both the primary strength and the pri- 



mary weakness of the preferential attachment model. For example, preferen- 
tial attachment models tend to assume that networks are homogeneous. When 
networks have significant and identifiable inhomogeneities (as is the case for 
the citation network), the data can require augmentation of the preferential at- 
tachment model to account for them. 

The primary conclusion of Ref. ( Lehm ann et al, 2003} is that the majority 
of nodes in a citation network 'die' after a short time, never to be cited again. 
A small population of papers remains 'alive' and continues to be cited many 
years after publication. In Ref. \ Lehma nn et al, 2 005 1 it was established that 
this distinction between live and dead papers is an important inhomogene- 
ity in the citation network that is not accounted for by the simple preferential 
attachment model. Interestingly, a similar distinction between live and dead 
nodes was recently independently suggested by (Redner, 20041. In this paper, 
we will explore how the distinction between live and dead papers manifests 
itself in network models and thus suggest an extension of the preferential at- 
tachment model. 



2 The SPIRES data 

The work in this paper is based on data obtained from the SPIRES 2 database of 
papers in high energy physics. More specifically, our dataset is the network of 
all citable papers from the theory subfield, ultimo October 2003. After filtering 
out all papers for which no information of time of publication is available and 
removing all references to papers not in SPIRES, a final network of 275665 

1 More precisely, de Solla Price was the first person to re-think Simon's model and use it as a 
basis of description for any kind of network, cf. {Newman, 20U3t - 

2 SPIRES is an acronym for 'Stanford Physics Information REtrieval System' and is the oldest 
computerized database in the world. The SPIRES staff has been cataloguing all significant papers 
in high energy physics and their lists of references since 1974. The database is open to the public 
and can be found at http://www.slac.stanford.edu/spires/ 
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nodes and 3 434 175 edges remains. 

Above we described a dead node as one that no longer receives citations, 
but how does one define a dead node in real data? We have tested several def- 
initions, and the results are qualitatively independent of the definition chosen. 
Therefore, we can simply define live papers as papers cited in 2003. While 
we acknowledge the existence of papers that receive citations after a long dor- 
mant period, such cases are rare and do not affect the large scale statistics. In 
Figure |3 the (normalized) degree distributions of live and dead papers in the 
SPIRES data are plotted, and it is clear that the two distributions differ signifi- 
cantly. Having isolated the dead papers, we are not only able to plot them; we 
can also determine the empirical ratio of live to dead papers as a function of 
the number of citations per paper, k. In Figure^this ratio is displayed with k 
ranging from 1 to 150 (Papers with zero citations are dead by definition.) Over 




Figure 1: Displayed above is ratio of live to dead papers as a function of k. 
Error bars are calculated from square roots of the citation counts in each bin. 
Also, a straight line is present to illustrate the linear relationship between the 
live and dead populations for low values of k. 

most of this range, the data is well described by a straight line. Note that the 
data for dead papers with high citation counts is very sparse. For example, 
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only 0.15% of the dead papers have more than 100 citations, so the statistics 
beyond this point are highly unreliable. More generally a linear plot of the 
ratio of live to dead papers provides a pessimistic representation of the data. 
We therefore conclude that the ratio of dead to live papers is relatively well de- 
scribed by the simple form 1 / (k + 1 ) for all but the largest values of k, for which 
the number of dead papers is overestimated by a factor of two to three. In the 
following section, we will make use of this relation to extend the preferential 
attachment model to include dead nodes. 

3 The Model 

The basic elements of the preferential attachment model are growth and prefer- 
ential attachment iBarabasi & Albert, 19991. The simplest model starts out with 
a number of initial nodes and at each update, a new node is added to the 
database. Each new node has m out-links that connect to the nodes already 
in the database. Each new node enters with k = real in-links. This is the 
growth element of the model. Note that, since we have chosen to eliminate all 
references to papers not in SPIRES from the dataset, there is a sum rule such 
that the average number of citations per paper is also m. Preferential attachment 
enters the model through the assumption that the probability for a given node 
already in the database to receive one of the m new in-links is proportional to 
its current number of in-links. In order for the newest nodes (with k = in- 
links) to be able to begin attracting new citations, we load each node into the 
database with k$ — 1 'ghost' in-links that can be subtracted after running the 
model. The probability of acquiring new citations is proportional to the total 
number of in-links, both real and ghost in-links. 

One of the simplest ways to implement this simple incarnation of the pref- 
erential attachment model described above is to regard fcg as a free parameter. 
This allows us to estimate when the effects of preferential attachment become 
important. Since there is no a priori reason why a paper with 2 citations (in- 
links) should have a significant advantage over a paper with 1 citation, it is 
preferable to let the data decide. Thus, in our model, the probability that a live 
paper with k citations acquires a new citation at each time step is proportional 
to k + ko with fcg > 0. Also, note that we can think of the displacement fcg 
as a way to interpolate between full preferential attachment (ko = 1) and no 
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preferential attachment (kg — > oo). 

The significant extension of the simple model to be considered here is that, 
in our model, each paper has some probability of dying at every time step. From 
Section^ we have a very good idea of what this probability should be: Figure^ 
shows us that for a paper with k citations, this probability is proportional to 
l/(k + 1) to a reasonable approximation. With this qualitative description of 
the model in hand, we proceed to its solution. 



4 Rate Equations 

One very powerful method for solving preferential attachment network mod- 
els is the rate equation approach, introduced in the context of networks by I Krapivsky et al, 2000 1 . 
Let L k and D k be the respective probabilities of finding a live or a dead paper 
with k real citations. As explained above, we load each paper into the database 
with k = real citations and m references. The rate equations become 

L k = m(A k _ l L k _ 1 - A k L k ) - n k L k + S kfl (1) 
D k = r] k L k , (2) 

where A k and rj k are rate constants. Since every paper has a finite number of 
citations, the probabilities L k and D k become exactly zero for sufficiently large 
k; we also define L k to be zero for k < 0. In this way, all sums can run from 
k = to infinity. These equations trivially satisfy the normalization condition 

£(1* + D k ) = 1, (3) 

k 

for any choice of rj k and A k . However, we also demand that the mean number 
of references is equal to the mean number of papers 

Y J k{L k + D k )=m. (4) 

This constraint must be imposed by an overall scaling of r\ k and A k . The model 
described in Section|3]corresponds to a choice of rj k and A k , where 

mA k = a(k + k ) (5) 

is the preferential attachment term and 

* = kTi ^ 
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corresponds to the previously described death mechanism. We insert Equa- 
tions O and l|6) into Equation and perform the recursion to find 

_ T(k + 2)T(k + k ) r(l-fr) T(l-k 2 ) 

k ak x k 2 r(ito) r(fc - fcx + 1) r(Jt - fc 2 + 1)' 

and of course D/ c = frLj- / (k + 1). The two new constants, fcj and kj are solutions 
to the quadratic equation 

(a{k + k ) + l)(k + l) + b = (8) 

as a function of fc. 



5 The ko — > oo Limit 

Before moving on, let us explore the limit where fco — > oo and preferential at- 
tachment is turned off. In this regime, the network is, of course, completely 
dominated by the death mechanism. We can either obtain this limit by again 
solving Equations QJ and with \ k = constant and n k = b/(k + 1), or we can 
make the more elegant replacement a = ako in Equation 0, and then take the 
limit ko — > oo for fixed oc. The two approaches are equivalent. We find 

and the Dj. are still simply &L^/ (k + 1). With this expression for Lj-, let us 
consider the limit of oc — > oo and b — > oo with the ratio r = b/(a + 1) « b/a 
fixed. In this limit, it is tempting to replace the term a./ (a + 1) by one 3 . In this 
case, the use of identities, such as 

00 k\ 1 

£(fc+7)! = IT _ 7)7! (10) 



3 For present purposes, this is appropriate when r > 2. When r < 2, the neglected factor is 
essential for ensuring the convergence of the average number of citations for the live and dead 
papers mi and niQ. 
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enable us to compute the fraction of dead papers /, and the average numbers 
of citations for live and dead papers. The results are simply 



m L = f~~2 (12) 
1 

m D = -, (13) 

r — 1 

and the average number of citations for all papers is evidently m = (1 — f)mi + 
/mo. The fraction of dead papers is / — > 1 — 0(l/b) and the average number 
of citations for all papers approaches mp. 

The most important result, however, is that in this limit we find that 

L fc ~I and D fc ~^, (14) 

where we assume that k > r. Thus, we see that power law distributions for both live 
and dead papers emerge naturally in the limit of f — > 1. In the literature, power laws 
in the degree distributions of networks are often regarded as an indication that 
preferential attachment has played an essential part in the generation of the 
network in question. It is thus of considerable interest to see an alternative and 
quite different way of obtaining them. 



6 The Full Model 

Let us now return to the full model and see how it compares to the data from 
SPIRES. With all zero cited papers in the dead category, the data yields the 
following average values: mi = 34.1, = 4.5 and m = 12.8. The fraction 
of live papers is / = 27.0%. With an rms. error of only 21%, we can do a least 
squares fit of to the distribution of live papers with parameters k$ — 65.6, 
a = 0.436, and b = 12.4. Although only the live data (the squares in Figure[2) is 
fitted, the agreement with the empirical data in Figures|5]and|3]is quite striking. 

From the model parameters fcg, a, b, we can calculate mean citation numbers 
for the fit of 32.9, 4.25, and 12.8 for the live, dead, and total population respec- 
tively; the fraction of live papers is found to be 29.8%. More interestingly, we 
learn from the fit that 7.5% of the papers with citations are actually alive. If we 
assign this fraction of the zero-cited papers to the live population, we find the 
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Figure 2: Log-log plots of the normalized degree distributions of live and dead 
papers. The filled squares represent the live data and the stars represent the 
dead data. Both lines are the result of a fit to the live data (filled squares) alone. 

following corrected values for the average values 31.5, 4.6 and 12.5 for the live, 
dead, and total population respectively; the fraction of live papers is adjusted 
to become 29.2%. Again, this is a striking agreement with the data. There is so 
little strain in the fit that we could have determined the model parameters from 
the empirical values of mi, mp, and /. Doing this yields only small changes in 
the model parameters and results in a description of comparable quality! 

Figure |2 reveals that fitting to the live distributions, results in systematic 
errors for high values of k when we extend the fit to describe the dead papers, 
but this is not surprising. Recall the similarly systematic deviations from the 
straight line seen in Figure ^ This figure also explains why the fit to the total 
distribution shows no deviations from the fit for high fc-values even though 
the total fit includes both live and dead papers — live papers dominate the total 
distribution in this regime. The obvious way to fix this problem is via a small 
modification of the tj%. In summary, the full model is able to fit the distributions 
of both live and dead papers with remarkable accuracy. 

One drawback, with regard to the full solution is the relatively impene- 
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Figure 3: A log-log plot of the normalized degree distribution of all papers (live 
plus dead). The points are the data; the fit (solid line) is derived from the fit to 
the live papers (filled squares) in Figure|2] 



trable expression for in Equation — associating any kind of intuition to 
the conglomerate of gamma-functions presented there can be difficult. Let us 
therefore demonstrate that can be well approximated by a two power law 
structure. We begin by noting that, in the limit of large fcg (as it is the case here), 
the values of k\ and k^ are simply 

1 b 

fcl = — + -r- -fcO (15) 
a akQ 

k > = - x ~m (16) 

Now, let us write out only the A:-dependent terms in Equation and assign 
the remaining terms to a constant, C 

(k + kp-iy. (k + iy. 

Lk = C (*-*!)! (*-*2)I (17) 

1 1 

^ C (k + k -l) 1 - k o-h (fc + l)-(i+ fc 2) (18) 

- C 1 1+1 » ^-s-/ (19) 

(fc + fc — 1) " ak ° {k + l) ak o 
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In Equation ( ITHI , we have utilized the fact that 

£±^1 « * s (20) 

when x — > oo, and in Equation l IT^l we have inserted the asymptotic forms of 
k\ and ki, from Equations H5i and ( 1161 - 

This expression for Lj- in Equation | |19> is only valid for large k and fcg, but it 
proves to be remarkably accurate even for smaller values of k. With the asymp- 
totic forms of k\ and k^ inserted, we can explicitly see that the first power law 
is largely due to preferential attachment and that the second power law is ex- 
clusively due to the death mechanism. The form for very large k is unaltered 
by the parameter b. This is not surprising, since there is a low probability for 
highly cited papers to die. We see that the primary role of the death mechanism 
in the full model is to add a little extra structure to the Lj. for small k. 



7 Conclusions 

Compelled by a significant inhomogeneity in the data, we have created a model 
that provides an excellent description of the SPIRES database. It is obvious that 
the death mechanism (b ^ 0) is essential for describing the live and dead pop- 
ulations separately, but less clear that it is indispensable when it comes to the 
total data. Fitting the total distribution with a preferential attachment only 
model (b = 0) results in a = 0.528 and kg = 13.22 and with a rms. fractional 
error of 33.6%. This fit displays systematic deviations from the data, but con- 
sidering that the fit ignores important correlations in the dataset, the overall 
quality is rather high. The important lesson to learn from the work in this pa- 
per, is that even a high quality fit to the global network distributions is not 
necessarily an indication of the absence of additional correlations in the data. 

The most significant difference between the full live-dead model and the 
model described above is expressed in the value of the parameter kg. The 
value of this parameter changes by a factor of approximately 5, from 65.6 to 
13.2. It strikes us as natural that preferential attachment will not be important 
until a paper is sufficiently visible for authors to cite it without reading it. We 
thus believe that fcg « 66 is a more intuitively appealing value for the onset of 
preferential attachment. However, independent of which value of the fco P a ~ 
rameter one prefers, the comparison of these two models clearly demonstrates 
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the danger of assigning physical meaning to even the most physically moti- 
vated parameters if a network contains unidentified correlations or if known 
correlations are neglected in the modeling process. Specifically it would be ill 
advised to draw strong conclusions about the onset of preferential attachment 
if the death mechanism is not included in the model making. 

In summary the live and dead papers in the SPIRES database constitute 
distributions with significantly different statistical properties. We have con- 
structed a model which includes modified preferential attachment and the death 
of nodes. This model is quantitatively successful in describing the citation dis- 
tributions for live and dead papers. The resulting model has also been shown 
to produce a two power law structure. This structure provides an appealing 
link to the work in ( Lehm ann et ah, 2003) , where a two power law structure 
was adopted to characterize the form of the SPIRES data without any theoret- 
ical support. Finally we have been shown that even in the absence of prefer- 
ential attachment, the death mechanism alone can result in power laws. Since 
many real world networks have a large number of inactive nodes and only a 
small fraction of active nodes, we are confident that this mechanism will find 
more general use. 
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